ARROW-13330: [Go][Parquet] Add the rest of the Encoding package #10716

zeroshade · 2021-07-13T22:05:30Z

@emkornfield Thanks for merging the previous PR #10379

Here's the remaining files that we pulled out of that PR to shrink it down, including all the unit tests for the Encoding package.

github-actions · 2021-07-13T22:05:57Z

https://issues.apache.org/jira/browse/ARROW-13330

zeroshade · 2021-07-20T14:54:39Z

@emkornfield @sbinet Bump for visibility to get reviews

emkornfield · 2021-07-20T15:35:28Z

Sorry busy week this week, will try to get to it EOW or sometime next week.

zeroshade · 2021-07-28T14:33:40Z

@emkornfield bump

emkornfield · 2021-07-30T15:39:25Z

sorry I have had less time then I would have liked recently for Arrow reviews, will try to get to this soon.

emkornfield · 2021-07-30T15:40:13Z

go/parquet/internal/encoding/encoding_benchmarks_test.go

+)
+
+func BenchmarkPlainEncodingBoolean(b *testing.B) {
+	for sz := MINSIZE; sz < MAXSIZE+1; sz *= 2 {


there isn't a built in construct in GO benchmarks for adjusting batch size?

nope. b.N is the number of iterations to run under the benchmarking timer, so by doing this little loop it creates a separate benchmark for each of the batch sizes

emkornfield · 2021-07-30T15:44:53Z

go/parquet/internal/encoding/levels.go

+}
+
+// EncodeNoFlush encodes the provided levels in the encoder, but doesn't flush
+// the buffer and return it yet, appending these encoded values. Returns the number


does it ever return less than the values provided in lvls? If so please document it (if not maybe still note that this is simply for the API consumer's convenience?)

it should always return len(lvls), if it returns less that means it encountered an error/issue while encoding. I'll add that to the comment.

so it is up to users to check that? should it propogate an error instead?

currently this is an internal only package so it is not exposed for non-internal consumers to call this and the column writers check the return and propagate an error if it doesn't match. Alternately, I could modify the underlying encoders to have Put return an error instead of just a bool and then propagate an error. I believe currently it just returns true/false for success/failure out of convenience and I never got around to having it return a proper error.

I'll take a look at how big such a change would be.

updated this change to add error propagation to all the necessary spots (and all the subsequent calls and dependencies) so that consumers no longer have to rely on checking the number of values returned but can see easily if an error was returned by the encoders.

go/parquet/internal/encoding/levels.go

…ck the number of values encoded.

zeroshade · 2021-08-05T15:44:47Z

@emkornfield bump. i've responded to and/or addressed all of the comments so far. 😄

go/parquet/internal/encoding/encoding_test.go

emkornfield · 2021-08-11T04:46:45Z

go/parquet/internal/encoding/encoding_test.go

+
+func TestDeltaByteArrayEncoding(t *testing.T) {
+	test := []parquet.ByteArray{[]byte("Hello"), []byte("World"), []byte("Foobar"), []byte("ABCDEF")}
+	expected := []byte{128, 1, 4, 4, 0, 0, 0, 0, 0, 0, 128, 1, 4, 4, 10, 0, 1, 0, 0, 0, 2, 0, 0, 0, 72, 101, 108, 108, 111, 87, 111, 114, 108, 100, 70, 111, 111, 98, 97, 114, 65, 66, 67, 68, 69, 70}


where do the expected values come from?

The Delta Byte Array Encoding was implemented in another parquet library, so I stole the expected values from their unit tests and also did some manual confirmation to ensure they were the correct values to the best of my knowledge.

emkornfield · 2021-08-11T04:52:14Z

go/parquet/internal/encoding/memo_table.go

+	Reset()
+	// Size returns the current number of unique values stored in the table
+	// including whether or not a null value has been passed in using GetOrInsertNull
+	Size() int


int is 32 bits here? does it pay to have errors returned for the insertion operations if they exceed that range?

int is implementation defined, on 32 bit platforms int will be 32 bits, on 64 bit platforms it will be 64 bits.

In this memo table, I guess I assumed that it was unlikely that there'd ever be billions of elements in the table such that it was necessary or likely for a check to exceed the range of an int. Personally i'd prefer to not add the extra check inside of the insertion operation simply because it's a critical path that is likely to be inside of a tight loop so if possible, I'd prefer to avoid adding the check for whether the new size will exceed math.MaxInt but rather just document that the memotable has a limitation on the number of unique values it can hold being MaxInt. Thoughts?

emkornfield · 2021-08-11T04:52:48Z

go/parquet/internal/encoding/memo_table.go

+	MemoTable
+	// ValuesSize returns the total number of bytes needed to copy all of the values
+	// from this table.
+	ValuesSize() int


this includes overhead for string length?

No, it's just the raw bytes of the strings as they are stored like in an arrow array (since I back the binary memotable using a BinaryBuilder and just call DataLen on it). This is specifically used for copying the raw values as a single chunk of memory which is why the offsets are stored separately and copied out separately.

go/parquet/internal/encoding/memo_table_test.go

emkornfield

ok I've gone through most of it now. I'm a bit surprise d acustom hash table implementation performs best (I'd expected Go's implementation to be highly tuned), any theories as to why?

zeroshade · 2021-08-11T14:31:54Z

@emkornfield The performance difference is actually entirely based on the data and types (which is why I left both implementations in here).

My current theory is that the custom hash table implementation (which I took from the C++ memo table implementation) is that it's simply a case of optimized handling of smaller values resulting is significantly fewer allocations.

With go1.16.1 on my local laptop:

goos: windows
goarch: amd64
pkg: github.com/apache/arrow/go/parquet/internal/encoding
cpu: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
BenchmarkMemoTableFloat64/100_unique_n_65535/go_map-12         	     100	  12464686 ns/op	    7912 B/op	      15 allocs/op
BenchmarkMemoTableFloat64/100_unique_n_65535/xxh3-12           	     230	   5325610 ns/op	    7680 B/op	       2 allocs/op
BenchmarkMemoTableFloat64/1000_unique_n_65535/go_map-12        	      85	  14819479 ns/op	  123337 B/op	      70 allocs/op
BenchmarkMemoTableFloat64/1000_unique_n_65535/xxh3-12          	     186	   5388082 ns/op	  130560 B/op	       4 allocs/op
BenchmarkMemoTableFloat64/5000_unique_n_65535/go_map-12        	      79	  16167963 ns/op	  493125 B/op	     195 allocs/op
BenchmarkMemoTableFloat64/5000_unique_n_65535/xxh3-12          	     132	   7631678 ns/op	  523776 B/op	       5 allocs/op
BenchmarkMemoTableInt32/100_unique_n_65535/xxh3-12             	     247	   4648034 ns/op	    5120 B/op	       2 allocs/op
BenchmarkMemoTableInt32/100_unique_n_65535/go_map-12           	     602	   1885666 ns/op	    6468 B/op	      17 allocs/op
BenchmarkMemoTableInt32/1000_unique_n_65535/xxh3-12            	    1429	    895664 ns/op	   87040 B/op	       4 allocs/op
BenchmarkMemoTableInt32/1000_unique_n_65535/go_map-12          	     632	   1893139 ns/op	  104802 B/op	      72 allocs/op
BenchmarkMemoTableInt32/5000_unique_n_65535/xxh3-12            	    1020	   1295938 ns/op	  349184 B/op	       5 allocs/op
BenchmarkMemoTableInt32/5000_unique_n_65535/go_map-12          	     489	   2457437 ns/op	  420230 B/op	     187 allocs/op
BenchmarkMemoTable/100_unique_len_32-32_n_65535/xxh3-12        	     398	   2990361 ns/op	   16904 B/op	      23 allocs/op
BenchmarkMemoTable/100_unique_len_32-32_n_65535/go_map-12      	     428	   2811322 ns/op	   19799 B/op	      28 allocs/op
BenchmarkMemoTable/100_unique_len_8-32_n_65535/xxh3-12         	     356	   3463212 ns/op	   16904 B/op	      23 allocs/op
BenchmarkMemoTable/100_unique_len_8-32_n_65535/go_map-12       	     380	   3149174 ns/op	   19781 B/op	      28 allocs/op
BenchmarkMemoTable/1000_unique_len_32-32_n_65535/xxh3-12       	     366	   3188208 ns/op	  176200 B/op	      32 allocs/op
BenchmarkMemoTable/1000_unique_len_32-32_n_65535/go_map-12     	     361	   9588561 ns/op	  211407 B/op	      67 allocs/op
BenchmarkMemoTable/1000_unique_len_8-32_n_65535/xxh3-12        	     336	   3529636 ns/op	  176200 B/op	      32 allocs/op
BenchmarkMemoTable/1000_unique_len_8-32_n_65535/go_map-12      	     326	   3676726 ns/op	  211358 B/op	      67 allocs/op
BenchmarkMemoTable/5000_unique_len_32-32_n_65535/xxh3-12       	     252	   4702015 ns/op	  992584 B/op	      42 allocs/op
BenchmarkMemoTable/5000_unique_len_32-32_n_65535/go_map-12     	     255	   4884533 ns/op	 1131141 B/op	     181 allocs/op
BenchmarkMemoTable/5000_unique_len_8-32_n_65535/xxh3-12        	     235	   4814247 ns/op	  722248 B/op	      41 allocs/op
BenchmarkMemoTable/5000_unique_len_8-32_n_65535/go_map-12      	     244	   5340692 ns/op	  860673 B/op	     179 allocs/op
BenchmarkMemoTableAllUnique/values_1024_len_32-32/go_map-12    	    4509	    271040 ns/op	  211432 B/op	      67 allocs/op
BenchmarkMemoTableAllUnique/values_1024_len_32-32/xxh3-12      	    8245	    150411 ns/op	  176200 B/op	      32 allocs/op
BenchmarkMemoTableAllUnique/values_1024_len_8-32/go_map-12     	    4552	    255188 ns/op	  211443 B/op	      67 allocs/op
BenchmarkMemoTableAllUnique/values_1024_len_8-32/xxh3-12       	    9828	    148377 ns/op	  176200 B/op	      32 allocs/op
BenchmarkMemoTableAllUnique/values_32767_len_32-32/go_map-12   	     100	  11416073 ns/op	 6324082 B/op	    1176 allocs/op
BenchmarkMemoTableAllUnique/values_32767_len_32-32/xxh3-12     	     222	   5578530 ns/op	 3850569 B/op	      49 allocs/op
BenchmarkMemoTableAllUnique/values_32767_len_8-32/go_map-12    	     100	  11123497 ns/op	 6323152 B/op	    1171 allocs/op
BenchmarkMemoTableAllUnique/values_32767_len_8-32/xxh3-12      	     212	   6094342 ns/op	 3850569 B/op	      49 allocs/op
BenchmarkMemoTableAllUnique/values_65535_len_32-32/go_map-12   	      44	  25062816 ns/op	12580560 B/op	    2384 allocs/op
BenchmarkMemoTableAllUnique/values_65535_len_32-32/xxh3-12     	      74	  15704849 ns/op	10430028 B/op	      53 allocs/op
BenchmarkMemoTableAllUnique/values_65535_len_8-32/go_map-12    	      40	  26667808 ns/op	12580008 B/op	    2381 allocs/op
BenchmarkMemoTableAllUnique/values_65535_len_8-32/xxh3-12      	      81	  14417075 ns/op	10430024 B/op	      53 allocs/op

The ones labeled xxh3 are my custom implementation and the go_map is the go builtin map based implementation, if you look closely at the results of the benchmark, in most cases, the xxh3 based implementation is fairly significantly faster, sometimes even twice as fast, with significantly fewer allocations per loop (for example, in the binary case with all unique values you can see the 2384 allocations in the go map based implementation vs the 53 in my custom implementation, having a ~37% performance improvement over the go-map based implementation.

But if we look at some other cases, for example the 100_unique_len_32-32_n_65535 and 100_unique_len_8-32_n_65535 cases, which correspond to 65535 binary strings of length 32 or of lengths between 8 and 32 with exactly 100 unique values among them, we see that the go map based implementation is actually slightly more performant despite a few more allocations. The same thing happens with the Int32 memotable with only 100 unique values over 65535 values, but when we increase to 1000 unique values or 5000 unique values my custom one seems to do better. Which seems to indicate that the builtin go map is faster in cases with lower cardinality of unique values, while my custom implementation is more performant for inserting new values. Except in the Float64 case, where my custom implementation is faster in all cases even in the 100 unique value case, which I attribute to the faster hashing of smaller byte values via xxh3.

TL;DR: All in all, in most cases, the custom implementation is faster but in some cases with lower cardinality of unique values (specifically int32 and binary strings) the implementation using go's map as the basis can be more performant.

zeroshade · 2021-08-11T15:16:23Z

@emkornfield Just to tack on here, another interesting view is looking at a flame graph of the CPU profile for the BenchmarkMemoTableAllUnique benchmark case, just benchmarking the binary string case where the largest difference between the two is that in the builtin Go Map based implementation I use a map[string]int to map strings to their memo index, whereas in the custom implementation I use an Int32HashTable to map the hash of the string to the memo index, with the hash of the string being calculated with the custom hash implementation.

Looking at the flame graph you can see that a larger proportion of the CPU time for the builtin map-based implementation is spent in the map itself whether performing the hashes or accessing/growing/allocating vs adding the strings to the BinaryBuilder while in the xxh3 based custom implementation there's a smaller proportion of the time spent actually performing the hashing and the lookups/allocations. In the benchmarks I'm specifically using 0 when creating the new memo table to avoid pre-allocating in order to make the comparison between the go map implementation a closer / better comparison since, to my knowledge, there's no way to pre-allocate a size for the builtin golang map. But if I change that and have it actually use reserve to pre-allocate space the difference can become more pronounced.

zeroshade · 2021-08-12T18:42:42Z

Just to chime in one last piece on this: while it's extremely interesting that a custom hash table is performing better than Go's builtin map in many cases, remember that we're still talking in absolute terms about differences of between 1ms and 0.1ms, so unless you're using it in a tight loop with a TON of entries/lookups, you're probably better off using Go's builtin map just because it's a simpler implementation and built-in rather than having to use a custom hash table implementation with external dependencies as it'll be more than sufficiently performant in most cases, but for this low level handling for dictionary encoding in parquet, the performance can become significant on the scale that this will be used, making it preferable to use the custom implementation.

emkornfield · 2021-08-17T03:20:30Z

+1 merging. Thank you @zeroshade

@emkornfield

Here's the next chunk of code following the merging of #10716 Thankfully the metadata package is actually relatively small compared to everything else so far. @emkornfield @sbinet @nickpoorman for visibility Closes #10951 from zeroshade/parquet-metadata-package Authored-by: Matthew Topol <mtopol@factset.com> Signed-off-by: Matthew Topol <mtopol@factset.com>

@emkornfield

Here's the next chunk of code following the merging of apache#10716 Thankfully the metadata package is actually relatively small compared to everything else so far. @emkornfield @sbinet @nickpoorman for visibility Closes apache#10951 from zeroshade/parquet-metadata-package Authored-by: Matthew Topol <mtopol@factset.com> Signed-off-by: Matthew Topol <mtopol@factset.com>

Matthew Topol added 2 commits July 13, 2021 17:55

remaining encoding logic and unit tests, plus hashing memotable

09d59be

additional zigzag encoding tests

8dfb8cc

github-actions bot added the Component: Go label Jul 13, 2021

emkornfield reviewed Jul 30, 2021

View reviewed changes

go/parquet/internal/encoding/levels.go Outdated Show resolved Hide resolved

emkornfield reviewed Jul 30, 2021

View reviewed changes

go/parquet/internal/encoding/levels.go Outdated Show resolved Hide resolved

emkornfield reviewed Jul 30, 2021

View reviewed changes

go/parquet/internal/encoding/levels.go Outdated Show resolved Hide resolved

Matthew Topol added 2 commits July 30, 2021 12:05

update comments, change panics to errors

f05244e

propagate errors for encoding rather than relying on consumers to che…

bfd3524

…ck the number of values encoded.

emkornfield reviewed Aug 11, 2021

View reviewed changes

go/parquet/internal/encoding/encoding_test.go Show resolved Hide resolved

emkornfield reviewed Aug 11, 2021

View reviewed changes

go/parquet/internal/encoding/memo_table_test.go Outdated Show resolved Hide resolved

emkornfield reviewed Aug 11, 2021

View reviewed changes

add extra bit representations of NaN for test memo test

ec6f92c

emkornfield closed this in 820e506 Aug 17, 2021

zeroshade mentioned this pull request Aug 17, 2021

ARROW-13646: [Go][Parquet] adding the parquet metadata package #10951

Closed

zeroshade deleted the parquet-encoding-p2 branch September 27, 2021 17:14

asfimport mentioned this pull request Aug 17, 2021

[Go][Parquet] Add Encoding Package Part 2 #29006

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-13330: [Go][Parquet] Add the rest of the Encoding package #10716

ARROW-13330: [Go][Parquet] Add the rest of the Encoding package #10716

zeroshade commented Jul 13, 2021

github-actions bot commented Jul 13, 2021

zeroshade commented Jul 20, 2021

emkornfield commented Jul 20, 2021

zeroshade commented Jul 28, 2021

emkornfield commented Jul 30, 2021

emkornfield Jul 30, 2021

zeroshade Jul 30, 2021

emkornfield Jul 30, 2021

zeroshade Jul 30, 2021

emkornfield Jul 30, 2021

zeroshade Jul 30, 2021

zeroshade Jul 30, 2021

zeroshade commented Aug 5, 2021

emkornfield Aug 11, 2021

zeroshade Aug 11, 2021

emkornfield Aug 11, 2021

zeroshade Aug 11, 2021

emkornfield Aug 11, 2021

zeroshade Aug 11, 2021

emkornfield left a comment

zeroshade commented Aug 11, 2021

zeroshade commented Aug 11, 2021 •

edited

Loading

zeroshade commented Aug 12, 2021

emkornfield commented Aug 17, 2021

ARROW-13330: [Go][Parquet] Add the rest of the Encoding package #10716

ARROW-13330: [Go][Parquet] Add the rest of the Encoding package #10716

Conversation

zeroshade commented Jul 13, 2021

github-actions bot commented Jul 13, 2021

zeroshade commented Jul 20, 2021

emkornfield commented Jul 20, 2021

zeroshade commented Jul 28, 2021

emkornfield commented Jul 30, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zeroshade commented Aug 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emkornfield left a comment

Choose a reason for hiding this comment

zeroshade commented Aug 11, 2021

zeroshade commented Aug 11, 2021 • edited Loading

zeroshade commented Aug 12, 2021

emkornfield commented Aug 17, 2021

zeroshade commented Aug 11, 2021 •

edited

Loading